Lending Club Case Study

Problem Statement:

You work for a consumer finance company which specialises in lending various types of loans to urban customers. When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

1. If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company

2. If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company

image.png

Contents of Notebook

  1. Data Understanding :

    Target Variable is Loan Status Feature

  2. Data CleanUp:

    Analyzed the Rows with lost of missing values , Rows with single Unique Values, Duplicate Rows, Highly Correlated Features

3.Feature Extraction:

Based on Type Driven, Data Driven, Busines Driven

4.Imputation & Outlier Treatment

5.Encoding of Categorical Features

6.Utility Functions:

plotMissingColumns(data) ,uniqueVsMissingDataMapping(data),convertToYear(x):, convertToMonth(x):, convertFeatureToDateTime(data, colName, colPrefix) , showchargedoffpropation(data,col),segmentedUnivariateAnalysis(data,col)

7.Univariate / Segmented / Bivariate Analysis of Features

Note: Color Coding used in NoteBook

header has different color based on :

Purpose Color
Section Division
High Relevant Features
Low Relevant Features
No Relevance Features

Part 1 : Data understanding

Target Feature For Analysis

Based on the dataset for Analysis we have the

1.Target Variable as Loan Status where

a. Charged off : means loan is default one and customer has not paying the Loan amount back to the Firm.
b. Fully Paid : the Loan has been successfully closed and borrower has returned all money ( Principal + Interest) back to Firm.
c. Current : Still Borrower is returning back to firm & its still not completed and its a ongoing when the data has been caputured. We will drop it off as it won't help in analysis of Loan defaulter

Data Cleaning & Manipulation

1. Check Missing Values in Dataset

Function Created For Ploting Missing Data

Observation : There are lots of Missing Data in Dataset

1.Drop the Column which has all more than 50% columns values as missing.

Now we left with only few Features which is having Missing Values

2. Lets analyse the Duplicate Columns header

3. Lets analyse Based on Duplicate Rows

4. Lets analyse the unique values in each Columns

Function Created For Ploting Unique Vs Missing Data
Observation
1.Green Line touching bottoms : There are some features whose unique count reaches to total no. of rows so with every column value as a unique value won't contribute to the any relation ship with other features
2.Blue Line intersecting with Green Lines : There is one feature desc whose missing value and unique count overlap so potential candidate to observe that do we need column r drop it
3.Green Lines at top : There are some feature whose unique value is 1 , so it will have constant relation with all features in all rowso may be we can drop this.

a. Unique Count equals to Row Count : Drop Features

b.Check Features whose Unique Values ~ Missing Value ( both are very high): Desc

c. Drop the Columns whose Unique Value = 1

Observation

Now there are only 5 columns where data are missing

Segregate Features Based on Business Values

Feature data which will be available for Customer when they apply for Loan: Loan Amount ,Interest Rate ,Term ,Annual Income, Loan Grade, Debt to Income Ratio, Applicant Income, Loan Purpose, Revolving Utilization of Credit, Employee Length

Feature data which will be not be available for Customer when they apply for Loan: 'out_prncp','total_pymnt','total_pymnt_inv','total_rec_prncp','total_rec_int','total_rec_late_fee','collection_recovery_fee','last_pymnt_d','last_pymnt_amnt','last_credit_pull_d'

Heat Map For correlations between Features

Observation

There are some columns in the data set which have high correlation so lets analyse those and if required may be we can drop some of them as feature with high correlation and so same impact on target variable will have.

Feature Reduction based on Correlation

Features having high corelations ( loan_amnt, funded_amnt, funded_amnt_inv)

Observation

1.loan_amnt : The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

2.funded_amnt: The total amount committed to that loan at that point in time.

3.funded_amnt_inv : The total amount committed by investors for that loan at that point in time.

Serve the same purpose that total amount given to borowwer

Observation

All there statistical plots looks simmilar for highly correlated Features

Delete funded_amnt_inv, funded_amnt as loan_amnt is highly correlated to it

Features having high corelations ( out_prncp, out_prncp_inv)

Details about the Column Data Types

Feature Extraction : Type Driven

Convert Term ( dtype : Object) in integer Columns

Convert Int_rate ( dtype: Object) in float Columns

Convert revol_util (dtype : Object) in float Columns

Convert Object into DateTime For Feautre : earliest_cr_line

Function Created For Extracting Years & Month

Convert Object into DateTime For Feautre : issue_d

Feature Extraction : Business Driven

Analyse the Target Feature : Loan_status

Observation

There are 14.2 % records in dataset which results to Charged off Loan Status (default loan)

seems to be Skewed data as there are only 14.2% of data available for analysis of loan defaults. real world application may be we need more instances of Charged off rows if its there

Encode the loan_status (Categorical variable ) to ( 0/1)

Lets first analyse Object Data Types (Categorical Features)

Feature Analysis : Univariate / Segmented / Bivariate

loan_amnt : (Continous Variable)

Observation

Most Loan Amount lies between 5K to 15K and there seems to be some outlier lets analyse it
Min Loan Amount : 500 Max Loan Amount : 35000

Outlier Check

Observation

Loan amnt Q1 & Q4 lies between 5000 to 15000 and charged off loans are also max in this range only this may be impact of the data set i.e. higher dataset details are available from here only.

Lets further analyse via segmented Analysis

Function For Segmented Univariate Analysis

Observation

In Each Quartile there is almost equal distribution of the Loan Amount Frequency

As Loan Amount Increases, both Charged Off Loan Count and Percentage increases, so Loan_amnt has impact on loan charged off condition

On the other side As loan Amount Increase the Fully Paid loan Percentage decreases.

Observation

Each Year company is giving loans of higher total loan amount, and may increase the chances of get loans charged off.

Term ( discreate Variable)

Observation

Mostly loan is taken for lower term i.e. 36 months

Observation

Most Loans are given for the 36 Months but there is a higher percentage of loans which is getting charged off belongs to 60 Months Loans i.e. 25%

14% difference of Loans gettting charged off from 60 Months to 36 Months. As Loan Term increase greater chances of Loans getting charged off

Observation

Loans of higher amount has been given for longer time period (60 Months) and with increase in loan amount and loan term there is greater chances of Loans getting charged off.

Pivot Table

int_rate (continous Variable)

Observation

Most Loans are given on the interest rate of 8.94% to 14.38%

Min interest Rate is 5% and Max is 25%

Outlier Check

Observation

Charged Off Loans Count and Percentage Increases with Loan Interest Rate.

Fully Paid Loans Count and Percentage Decreases with Loan Interest Rate.

Observation

60 months loans are given at higher interest rate and higher interest rate loans tends more to be charged off loan

Observation

int_rate between 10 to 15% has highest chance of charged off

Observation

int_rate incerease with incerease of funded amnt, but there is wide dark spread so 5000 rs loan amnt we can get in different rate of int_rate its not fixed , it will depend on term of loan

installment (discrete numerical Variable)

Observation

higher the loan amount more is the installement

on higher installment, charged off loans count is more compared to lower installments ( major lies between 400 to 800 installments period)

Based on Scatter Plot , does not seems to have any outlier as such, as for every installement period there is good amount of loan exist in that bucket

grade ( Categorical Variable)

Observation

Loan of A & B grade are Given most

Function to Show Chareged off Loans Percentage
Observation

there is a significance of grade on loan charged off ; i.e grade A & B loan is less charged off than grade B , C, D, E

loans of grade A & B is given more frequently than other grade

Chances of Getting Charged Off : Grade G >F > E> D> C > B > A

Observation

grade which is given frequently is A, B, C is of lower amount / installments than grade E, F, D, G

Max loss to lending firm has to suffer from grade B > C > D > E > F > A > G

Grade G Loans are given to high annaual income People.

Observation

Higher grade Loans has higher int rate

Observation

For 36 month grade A loan is given most

For 60 month grade B loan is given most

Loan count decerease as we go from A->B->C->D->E->F->G

Sub_grade ( Categorical Variable)

Observation

Sub_grade show same pattern as of grade Feature over charged off percentage i.e. increase from A to G, and as this is a categorical variable with lots of category may be we can think of dropping it off

Let Encode the grade & sub-grade ( from Label to dicreate numerical value)

emp_title (dtype: Object)

Observation

There are around 818 rows where person belongs to a company where their is both record of charged as well Fully paid

there are miss spelling of the company name as well e.g well fargo, wells fargo bank

there are 2K missing records

emp_length ( Categorical Variable)

Observation

10+ years people get most loans

Observation

There is no direct impact of no.of years of person job on loan charged off

Observation

No salary does not depend on no of years of experience

there is no direct correlation between exp_year and salary, just a slight impact we can see

Imputation of missing values in emp_length

Encode emp_length ( Categorical Variable) to discreate numerical value

home_ownership (categorical variable)

Observation

There is no significance difference of charged off loans Percentage based on Home Ownership.

Observation

Borrower having rent or Mortage ownership have taken the loans mostly

Observation

Mortage get the max loan amnt funded

Mortage and None have max annual income

annual_inc (Continous Variable)

Observation

Most people getting loan belongs to range of 0 ~ 2 le6 and there are some outliers, lets analyse it

Outlier Check

Outlier Removal
Observation

As Annual Income Increase chances of Loans getting charged off decreases and fully paid loans increases

Observation

Intially loan is given for high income group but now trend is limited to 67500 to 70000

Observation

40K to 60K have highest charged off then -->20k to 40 k --> 60 to 80

verification_status ( Categorical Variable)

Observation

Verified loan are more charging off than non verified , so need to recheck the verification process

Observation

Most loans are given to not verified

purpose (Categorical Variable)

Observation

For Purpose : small_business, debt_consolidation, credit_card, house are more contributing to the loan amount

Observation

For other & major_purpose loan amount is less but contributing more to charged off

debt Consolidation,other,credit_card : are most contributing to charged_Off loans

Observation

out of total loans given for small business , 25% of small business loans are getting charged off

Observation

Higher amount of Loan taken for purpose: Credit card, debt, house..

title (dtype: Object)

Observation

if we see purpose is divided into sub group and Debt Consolidation, Debt Consolidation Loan, debt consolidation are different values, lets drop this

zip_code (dtype : Object))

Observation

zip_code is encoded columns so we don't have the exact values , so lets drop this Feature

addr_state (dtype: Object)

Observation

NE ,NV states are contributing more to the Charged off Loans

dti (Continous Variable)

Observation

dti is distributed over a range of 0 to 30 and having max from 10 to 20 ( 25 percentile to 75 percentile)

as in second graph dti for fully paid and charged off have same infullence,

slightly higher dti tends to high chance of charged off

Observation

Higher dti tends to higher chances of Loans charged off and lower chances of Fully Paid

Observation

higher dti , will get chance of getting loan of higher amount

higher dti, will get loans at low interest rate like howe laon..

Observation

No significance of dti over annual income, spread is uniform across

issue year & issue month (dtype: datetime)

Observation

Charged off Loans are increasing over each year & month

Observation

Every year & every month charged off loan count is increasing as well total loss is also increasing

Observation

int_rate increasing with issue year

pub_rec_bankruptcies (discrete numerical variable)

Observation

higher the bankrupties higher the chances off charged off

Observation

As the bankrupcies increases chances of getting Loans decreases

Imputation of missing values in pub_rec_bankruptcies

pub_rec (discrete numerical variable)

Observation

higher the pub_rec higher the chances of loan default

recoveries (continous variable) ( Business Driven Feature Extraction)

Observation

Most Recovery Done From Lower Income Range Borrower

Observation

very less recovery is done... more annaul inc so more recovery, but line curve is almost straight

delinq_2yrs (discreate numerical Variable)

Observation

seems higher the delinq_2_yrs higher is the chance( 8 > 7 > 5> 6)

inq_last_6mths ( discreate numerical variable)

Observation

it revelas the credit score : the large the inquiry made , lower is the cibil score and same here larger the inquiry made , greater is the chance of charged off loans

open_acc (discreate numerical variable)

Observation

No clear distinction but at the end , means for extremly large no of open acc, its contributing more to the charged off

revol_bal (discreate numerical variable)

Observation

Both the above plot seems almost same , so there is no significance of revol_bal on charged off loans

revol_util (Continous Variable)

Observation

As revol util increases chances of loans getting charged off get increased

Observation

higher revol util tens to have high chances of Charged off

Observation

revol_util at higher range ( > 40) are more contributing to charged off loans

Observation

There is a uniform distribution of revol_util

Imputation of missing values

total_acc (discreate numerical Variable)

Observation

As no of total acc increase chances of loans getting charged off decreases and fully paid get increases

Observation

both plot look similar, so not a significant impact of the Total Account on Loan Status

Multivariate Analysis

Observation

as loan amount increase int_rate and dti is uniformly spread out

Most Borrower are from lower income group

Will drop unused Features

Feature specific to particular loan row column, these values can't be used when customer applies for loan and decide that is it going to be charged off or not

Recheck Columns ( Missing Values & dtypes)

Observation

No Features Left with Missing values, now good for ML Models

Observation

Most Feature are now have dtype as int / float, the object dtype ( verification_status, purpose, addr_state) need to be encoded.

** END OF EDA ***